LIBRARY 

OF  THE 

MASSACHUSETTS  INSTITUTE 

OF  TECHNOLOGY 


)D28 
M414 


Dewey 
APR  21  1977 


WORKING  PAPER 
ALFRED  P.  SLOAN  SCHOOL  OF  MANAGEMENT 


OPTIMIZATION  OF    DISTRIBUTED   DATABASE    SYSTEMS 
AND   COMPUTER  NETWORKS 

Jacob  Akoka   and    Peter    P-S    Chen 


WP  916-77 


March  1977 


MASSACHUSETTS 

INSTITUTE  OF  TECHNOLOGY 

50  MEMORIAL  DRIVE 

CAMBRIDGE,  MASSACHUSETTS  02139 


le  next  target  will  be  the  Italians." 

,  al-Mahdi,"  interrupted  the  Prince,  "to  destroy  the  Lir 

one  takes  the  Italian  economy  seriously.   We  could  even 
:ing  a  country  so  weak  and  defenseless." 

ice,"  retorted  al-Mahdi,  sharply,   "within  Italy  lies  th 
le  Roman  Catholic  Church.   To  destroy  the  Lira  is  child' 

But  to  crush  and  destroy  the  Vatican — the  very  heart  of 
lat  is  the  true  battle  ground  for  the  Faithful.   Can  we  > 
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Abstract 


In  this  paper,  a  model  is  developed  for  the  optimization  of 
distributed  database  systems  and  computer  networks.   Comparing  with 
previous  work  in  this  area,  the  model  is  more  complete  since  it 
considers  simultaneously  tlie  distribution  of  computation  power,  the 
allocation  of  programs  and  databases,  and  the  assignment  of 
communication  lines.   In  addition,  we  have  developed  a  "bounded  braiich 
and  bound"  algorithm  for  the  model.   The  algorithm  is  more  efficient 
than  most  of  the  existing  general  nonlinear  integer  programming 
algorithms  and  can  avoid  the  disadvantages  of  heuristic  algoritlims 
which  were  used  widely  in  the  optimization  of  computer  networks.   The 
algorithm  has  been  implemented  in  FORTRAiM. 

There  is  no  assumption  on  a  prefixed  network  topology.  The 
optimization  procedure  searches  for  a  system  configuration  satisfying 
various  constraints.   The  model  developed  in  the  paper  can  be  used 
in  the  design  of  distributed  databases  and  computer  networks.   It  can 
also  be  used  to  help  managers  to  decide  whether  to  centralize  or 
decentralize  their  information  systems. 
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I.     INTRODUCTION 

The  purpose  of  this  paper  is  to  provide  a  general  framework  for  the 
optimization  of  distributed  databases  systems  and  computer  networks.  A 
general  distributed  system  is  regarded  as: 

1)  A  collection  of  computers  of  different  capacities,  located  at 
various  nodes  of  a  network. 

2)  A  set  of  communication  lines  interconnecting  the  comnuters  of 
the  network 

3)  One  or  several  databases  attached  to  the  nodes  of  the  network. 

Each  of  the  computer  systems  possesses  facilities  for  managing  the  data 
(store,  retrieve,  or  process  the  databases).   Every  user  of  the  net- 
work is  able  to  communicate  with  all  other  users.   He  may  obtain  infor- 
mation stored  at  any  database  of  the  network. 

In  previous  works,  most  of  the  research  focused  on  the 
problem  of  minimizing  the  operating  cost  of  a  distributed  database. 
A  great  deal  of  attention  has  been  devoted  to  the  problem  of  optimal 
distribution  of  the  files  over  a  network  of  computer  systems. 

One  of  the  earliest  studies  of  the  file  allocation  problem  was 
done  by  Chu  (9).   He  developed  a  linear-programming  model  allocating 
files  so  that  the  allocation  yields  minimum  overall  operating  costs 
subject  to  the  following  constraints:  (i)  the  expected  time  to 
access  each  file  is  less  than  a  given  bound,  (ii)  the  amount  of 
storage  needed  at  each  computer  does  not  exceed  the  available  storage 
capacity.   His  model  includes  storage  costs,  queuing  delays,  and 
communication  costs.   But  he  assumed  that  the  number  of  copies  of  each 
file  in  the  system  is  known.   In  a  later  paper,  Chu (10)  developed  a 
procedure  to  determine  in  advance  how  many  redundant  copies  of  a 
file  are  required  to  achieve  a  desired  level  of  reliability.   Then, 
he  inserted  this  number  into  the  model,  and  the  basic  scheme  remains 
unchanged. 

Whitney(Ij4)  also  formulated  a  similar  model.   He  applied  it  to 
the  design  of  a  network  topology  and  to  the  allocation  of  file  copies. 
A  communication  network  optimization  procedure  is  developed.   He 
showed  that,  for  certain  communication  cost  functions,  the  tree  top- 
ology is  less  expensive  than  any  non-tree  topology.   In  addition,  he 
showed  that  the  system  delay  is  minimized  wlien  there  are  as  few 
independent  channels  as  possible. 

Casey  (4)  developed  a  procedure  for  finding  a  minimal  cost 
solution.   Heuristic  methods  are  used  in  this  paper  to  find  "good" 
solutions.  The  main  difference  between  his  paper  and  Chu's  paper  (10) 


is  that  the  number  of  copies  of  files  and  their  locations  are  treated 
as  variables.   He  showed  that  the  proportions  of  update  traffic  to 
query  traffic  generated  by  the  users  of  a  given  file  in  the  network 
could  be  used  to  determine  an  upper  bound  on  the  number  of  copies  of 
the  file  present  in  the  least  cost  network.   He  applied  his  algorithm 
to  real  data  for  the  ARPA  network  and  has  thus  shovm  the  process 
feasible  for  networks  of  moderate  size.  He  indicated  that  when  update 
traffic  equals  query  traffic,  it  is  efficient  to  store  all  files  at  a 
central  node. 

Recently,  Levin  and  Morgan  (17)  (20)developed  models  that  allow 
dependencies  between  files  and  programs.   In  another  paper  (ifj)  they 
developed  a  dynamic  model  for  the  multi-period  case.   In  this  model, 
the  access  time  requests  are  assumed  to  be  knoi\m  for  the  riext  T 
periods.   However,  the  assumption  that  the  access  request  patterns 
are  static  over  time  was  relaxed  and  a  dynamic  model  which  considers 
transition  costs  was  suggested. 

In  a  recent  paper  fl9)  Levin  and  Morgan  provided  a  framework  for 
research  in  optimizing  distributed  databases.   They  developed  three 
models  related  to  static  file  assignment  with  complete  information, 
dynamic  file  assignment  with  complete  information,  and  file  assignment 
with  incomplete  information. 

In  a  recent  study,  Chu  (ll)developed  several  models  to  study  the 
performance  of  file  directory  systems  for  operating  in  the  star  net- 
work and  distributed  network  topologies.   He  studied  the  cost-perfor- 
mance tradeoffs  of  three  classes  of  directory  systems.   Assuming  that 
the  transmission  cost  is  much  higher  than  the  storage  cost,  he  shovv-ed 
that  for  low  directory  update  rates  (less  than  10°6  of  the  query  rate), 
the  distributed  file  directory  yields  a  lower  operating  cost  than  the 
centralized  directory  system. 

A  particular  attention  should  be  devoted  to  Casey's  paper  (5) 
dealing  with  the  design  of  tree  networks  for  distributed  data.   He 
formulated  a  model  locating  information  resources  and  choosing  a  topo- 
logy for  a  network  of  distributed  data  files.   In  this  model,  he 
retains  features  such  as  discrete  capacity  assignm.ent,  economy  of 
scale,  and  distinction  between  query  and  update  transactions.   He 
developed  a  heuristic  method  and  formulated  an  algorithm  solving 
the  problem.  The  algorithm  was  tested  for  the  special  case  of  tree 
design. 

All  the  studies  mentioned  above  assume  that  at  each  node  of  the 
network,  a  computer  of  unlimited  capacity  is  available    .Most  of  the 
models  assumed  also  a  fixed  network  topology.   Strceter  (25)  relaxed 
the  latest  assumption  but  did  not  take  into  account  the  problem 
of  file  allocation.   In  addition,  he  assumed  a  fully  connected 
network. 


Modiano  (21)  extended  Casey's  model  (S  )  but  he  did  not  assume 
dependency  between  files  and  programs.   In  addition,  set-up  cost 
was  not  taken  into  account,  and  no  operational  solution  was  provided. 
Recently,  Chang  (  6 )  developed  a  model  for  distributed  computer  system 
design.   It  attempted  to  encompass  both  the  hardware  viewpoint  and 
the  software  viewpoint. 

In  this  paper  we  attempt  to  study  simultaneously  distribution 
of  computation  power,  databases  allocation,  and  communication  lines 
assignment.   We  shall  determine  where  computers  with  different 
capacities  should  be  located,  where  and  how  many  databases  should  be 
allocated,  and  which  communication  lines  to  assign. 

Our  procedure  is  not  based  on  a  fixed  network  topology.   We 
assume  that  there  are  dependencies  between  files  and  programs  (case 
of  heterogeneous  computer  networks  such  as  ARPA  Network  )  .      The 
reason  is  that  while  data  files  can  be  transferred  from  one  computer 
to  another,  programs  written  and  compiled  under  the  supervision  of 
one  operating  system  cannot  be  executed  in  a  different  computer. 
We  aslo  include  in  the  model  set-up  costs. 

Our  approach  is  different  from  the  previous  approaches 
described  above  in  the  following  sense: 

(a]  The  Model  Developed  is  more  Complete 

Our  model  includes  computation  power  allocation,  databases 
allocation,  link  capacities  assignment,  message  routing  and  program 
sharing.   The  pricing  schemes,  network  topology,  size  of  computing 
hardware,  communication  lines  capacity,  and  databases  allocation  are 
interrelated  in  an  optimal  design.  The  model  also  includes  the  return 
flow  of  information. 

(b)  The  Solution  Procedure  is  Operational 

Most  of  the  papers  cited  above  suffer  from  a  lack  of  a 
depth  discussion  of  the  methods  to  solve  the  models.   Most  of  these 
methods  are  heuristic  without  any  proof  of  convergence.   Usually,  no 
indication  of  how  these  methods  are  operational  is  given.   Finally, 
these  heuristics  may  be  misleading  in  terms  of  optimal  solution. 
In  order  to  avoid  these  shortcomings  and  solve  our  model  wliich  is  a 
large  scale  integer  nonlinear  programming  problem  ,  we  rather 
developed  a  "Bounded  Branch  and  Bound"  method.   Its  main  characteristics 
are: 

it  is  a  pure  mathematical  programming  algorithm  and  not  a 
heuristic  method. 

the  number  of  nodes  of  the  arborescence  that  have  to  be 
'tored  in  the  computer  is  at  most  equal  to  n,  where  n  is 
the  number  of  variables  of  the  problem. 


Il 


Therefore,  we  can  solve  very  large  scale  problems. 

It  can  solve  problems  where  the  objective  function  and  the 
constraints  are  nonlinear  without  any  assumption  about  the 
convexity  of  the  problems. 

Therefore,  we  can  solve  real-life  problems  where  included 
are  set  up  costs. 

It  has  the  property  of  heuristic  methods  in  the  sense 
that  a  rapid  solution  can  be  obtained  in  a  very  reasonable 
CPU  time. 
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II.   THE  MODEL 


In  this  section  we  consider  the  optimization  aspect  of  distributed 
systems.  A  typical  distributed  database  system  and  computer  network  is 
indicated  in  Figure  1. 
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FIGURE 


It  should  be  noted  that  for  a  distributed  system  (like  the  one 
depicted  in  Figure  1): 

(i)   The  computers  at  the  nodes  may  be  of  different  capacities. 

(ii)   The  databases  may  be  identical  or  different  (each  one  handling 
information  related  to  warehouses,  plants,  or  personnel),  and 
they  may  be  of  different  lengths. 

(iii)  The  communication  lines  may  be  of  diffcren,  capacities. 

(iv)  The  programs  operating  on  the  databases  are  specific  to  each 
computer  system  and  may  be  of  different  lengths. 


II. 1  Modeling  Aspects 

In  this  section,  we  consider  a  computer  network.   A  computer 
network  is  regarded  as  a  collection  of  computers  of  different  capacities, 
located  at  various  nodes,  interconnected  by  communication  lines  of 
various  capacities. 

A  node  is  a  point  (a^,  bj^)  in  the  plane.  The  set  of  nodes 
is  denoted  by  I .  At  each  node,  a  computer  may  be  installed.  The 
computer  can  be  a  mini-computer  or  any  other  computer-system.  The  set 
of  computers  is  denoted  by  M.   The  capacity  of  computer  m  is  denoted 
by  K,^.  The  capacity  of  a  computer  may  be  defined  by  its  throughput, 
the  main  storage  capacity,  or  other  parametric  representations.   The 
cost  of  a  computer  m  is  denoted  by  C^,  a  positive  number. 

At  each  node  of  the  network,  one  or  several  databases  may 
be  installed.   We  don't  consider  traditional  files  but  we  do  consider 
databases  where  are  included  all  the  data  used  by  a  specific  node, 
which  may  be  a  manufacturing  unit  of  a  large  geographically  dispersed 
company.  The  set  of  databases  is  denoted  by  N.  The  length  of  database 
n  is  denoted  by  Ij^.  The  unit  of  length  used  in  this  paper  is  thousand 
bytes.   Two  costs  are  associated  with  each  database:  (a)    the  set-up 
cost,  which  is  denoted  by  0^  for  database  n;  (b)  the  storage  cost, 
which  is  denoted  by  Cin.  the  storage  cost  per  thousand  bytes  of  data- 
base n  at  node  i. 

Since  we  may  have  to  consider  a  heterogeneous  distributed 
databases  system,  we  should  consider  a  set  of  programs  devoted  to  the 
use  of  the  databases.  The  set  of  programs  is  denoted  by  £.  The 
length  of  program  is  Lp,  and  the  cost  of  storage  per  thousand  bytes  of 
program  p  at  node  i  is  denoted  by  Sj^p. 

The  computers  of  the  network  and  their  associated  data- 
bases are  interconnected  by  communication  lines.  The  set  of  communica- 
tion lines  is  denoted  by  C.  The  capacity  of  communication  line  c  is 
Qc-   It  may  be  represented  by  the  line  speed  or  the  buffer  capacity. 
Three  different  costs  are  considered:  (a)  B^  is  the  cost  of  installing 
a  line  of  capacity  Q^>;  (b)  Qij  represents  the  communication  cost  per 
query  unit  from  i  to  j ;  (c)  Uij  represents  the  communciation  cost  per 
update  unit  from  i  to  j .   An  optimal  distributed  information  system 
is  defined  when: 

(i)   an  optimal  allocation  of  computers  over  the  network  is  specified. 
We  emphasize  the  fact  that  we  don't  consider  the  case  in  which 
the  topology  of  the  network  is  given. 

(ii)   an  optimal  allocation  of  the  databases  is  obtained.   We  may  have  at 
a  specific  node  one  or  more  databases. 

(iii)  an  optimal  allocation  of  the  programs  is  defined. 

(iv)   an  optimal  allocation  of  the  communication  lines  between  the 


nodes  is  specified. 

Our  problem  is  to  determine  these  optimal  allocations  at  minimum  cost. 
The  costs  considered  in  our  model  are: 

cost  of  computers  (equipment  cost) 

cost  of  databases  (set-up  and  operating  costs) 

cost  of  communication  lines 

storage  costs  of  databases 

storage  costs  of  programs 

communication  cost  of  queries  and  updates  from  nodes  to 
programs 

communication  costs  of  queries  and  updates  from  programs  to 
database 

All  the  costs  that  are  considered  in  our  model  are  the  costs 
per  month.  One  can  argue  about  the  time  period.   But,  taking  a  month 
as  a  time  period  may  lead  to  a  good  estimation  of  the  costs.   The 
cost  per  month  of  a  given  equipment  can  be  obtained  by  dividing  the 
purchase  cost  of  the  equipment  by  its  lifetime.   Of  course,  other 
units  of  time  periods  are  relevant  and  can  be  applied  in  this  model. 

II. 2  The  Nomendature 

Let  us  now  exhibit  all  the  variables  that  are  used  in  our 
model : 

f^i'  ^i^    location  of  node  i  in  plane 

I  set  of  nodes 

m  set  of  computers 

k^j  capacity  of  computer  m 

Cjn  cost  of  computer  m. 

N  set  of  databases 

In  length  of  database  n 

Dj^  set-up  cost  of  database  n 

C-  storage  cost  of  database  n  at  node  i 

P_  set  of  programs 

Lp  length  of  program  p 

S^  storage  cost  of  program  p  at  node  i 

C  set  of  communication  lines 

D^  capacity  of  communication  line  c 
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Bj,  installation  cost  of  a  communication  line  of  capacity  c 

Qj^^  communication  cost  per  query  unit  from  i  to  j 

Uj^j  communication  cost  per  update  unit  from  i  to  j 

Dj^^  matrix  of  distances  between  i  and  j 

Q.'^  query  traffic  from  node  i  to  database  d  via  program  p 

U-d  update  traffic  from  node  i  to  database  d  via  program  p. 

All  the  decision  variables  are  binary  and  are  specified  in  section 
(11. 3). 

II  .3  The  Objective  Function 

Let's  now  detail  every  cost  and  exhibit  the  objective 
function  and  the  constraints  of  our  model. 

(a)   Computers  Cost 

Computers  cost  is  assumed  to  be  a  function  of  their  capaci- 
ties. The  capacity  of  a  computer  may  be  expressed  in  terms  of  through- 
put, main  storage  capacity  or  other  parameters. 

Let's  define  C  \     if  a  computer  of  capacity  k  is  allocated 

y.   =  )    to  node  i 


1 


0  otherwise 


C  =   cost  per  month  of  computer  m  which  capacity 
is  kn, 


The  total  cost  of  computer  equipment  are: 

i  e  I  and  m  e  M 


Cost  1  =  .^  C   y"* 
1 ,  m   m  ^1 


(b)   Databases  cost 

Databases  cost  may  be  estimated  by  the  set-up  and 
operating  costs.   It  is  a  function  of  the  length  of  the  databases. 


Let  X.   =   ^    i 


1  if  a  database  of  length  Ij^  is  allocated  to  node 
i 

0  otherwise 


total  cost  of  database  n  which  its  length  is  given 
2^j  Q     =    in  thousand  bytes  and  is  equal  to  the  set-up  cost 
and  the  operating  cost. 


n 


The  total  cost  of  databases  is 


COST  2  =   E   D  X" 
n  1 


i  e  I ,  n  e  N 


(c)   Communication  Lines  Installation  Cose 

Communication  lines  connecting  the  nodes  of  the  network 
may  be  of  different  capacities  and  different  speeds.   It  seems 
reasonable  to  take  the  cost  of  communication  lines  as  a  function  of  the 
distance  between  the  nodes  and  of  the  capacities  to  which  we  add  set- 
up cost.   Therefore,  the  communication  lines  cost  is: 


COST  3  =  E    [Bl   +  (B  D.  .)]  L"^. 


i  e  I 

j  e  J  =  I 

C  e  C 


where 

Be  =  cost  of  installing  a  line  of  capacity  Q^,  (  variable  cost) 

Bl   =   fixed  cost  of  the  communication  line 
'^ii  ~  ^"^iO   ~   distance  between  nodes  i  and  j 

c    [  1   if  a  communication  line  of  capacity  Q^  connects  node 
ij  ~(  i  to  node  j 

^  0   otherwise 

(d)   Storage  Cost 

We  differentiate  between  the  storage  cost  of  databases  and 
the  storage  cost  of  programs.  A  difference  has  to  be  made  between 
programs  and  databases,  in  a  heterogeneous  distributed  computer  system, 
(See  (19). 

(d.l)   Storage  Cost  of  Database  i 

The  storage  cost  of  database  is  assumed  to  be  a 
linear  relationship.   It  is  a  functionof  the  length  of  the  database, 
using  thousand  bytes  as  a  unit. 


Let 


n 


1   is  a  databased  n  of  length  1   exists  at  node  i 
0   otherwise 


C.   =  storage  cost  per  month  of  database  n  which  its  length 
in  thousand  bytes  is  Ij^  existing  at  node  i 


iO 


The  total  storage  cost  is: 


n 


COST  4  =  Z  C.   X. 
in  1 
i,n 


i  e  I,  n  c  N 


(d..2)   Storage  Cost  of  Programs 


We  make  the  same  assumptions  as  above  but  we 
consider  the  storage  cost  different  from  the  precedent  one. 
Therefore,  the  total  storage  cost  of  programs  is: 


COST  5  =  E  S.   z? 
i,P  ^P  ^ 


i  e  I 
P  e  £. 


where:  S^^  =  storage  cost  per  month  of  program  of  length  1  at 
note  i 


zP 

1 


< 


1  if  a  copy  of  program  p  of  length  1   is  stored 
at  node  i 


0  otherwise 


^ 


(e)   Communication  Costs 


For  the  conununcation  costs,  we  differentiate  between  the 
following  costs : 

(a)  communication  cost  of  queries  from  nodes  to  programs 

(b)  communication  cost  of  updates  from  nodes  to  programs 

(c)  communication  cost  of  queries  from  programs  to  data- 
bases 

(d)  communication  cost  of  updates  from  programs  to 
databases. 

(e.l)  Communication  Cost  of  Queries  from  Nodes  to  Programs 

This  communication  cost  is  a  function  of  the  query 
traffic  between  nodes.   Let's  define: 

Q.    =  query  traffic  from  node  i  to  database  d  via  program 

^^'      P 

Q. .    =  communication  cost  per  query  unit 
from  node  i  to  node  j 


IJP 


1  if  transactions  from  node  i  to  database  d  are 
routed  to  node  j  via  program  p 


otherwise 


// 


The  total  communication  cost  of  queries  from  nodes  to  program  is: 

iel,  jeJ=I,p£p,  deN. 


COST  6  =   E 


i,j,P,d 


Q^   Q..  X^. 

IP     IJ      IJP 


(e.2)   Communication  Cost  of  Updates  from  Nodes  to  Programs 

We  make  the  same  assumptions  as  above.  Therefore,  the 
total  cost  of  updates  from  nodes  to  programs  is: 


iel,  jeJ=l, 
p  E  £,  d  e  N 


H 

■d 

COST 

7 

= 

); 

11. 

II. . 

X    . 

i 

,j 

.p 

d 

-LP 

ij 

iJP 

where: 

U.   =  update  traffic  from  node  i  to  database  d  via  program  p 

U. .   =  communication  cost  per  update  unit  from  ' 
■'     note  i  to  j 

X. .  =  same  as  above. 
iJP 

(e.3)  Communication  cost  of  queries  from  Programs  to  Data- 
Bases 

As  in  (€.1)  and  («.2),  we  consider  the  communication 
cost  of  queries  from  programs  to  databases,  as  a  function  of  the 
query  traffic  to  databases,  processed  at  different  nodes. 


Let  us  define: 
.d 


jkd 


.1     Qr     xf 

i,P   ip   13P 


query  traffic  to  database  d  processed 
at  node  j ,  during  a  month 


1  if  transactions  from  node  j  to  database  d  are 
routed  to  node  k 


0  otherwise. 
The  total  communication  cost  is  therefore: 


COST  8  =    .L.     ,      aX.,Q.,X.,  ,  (1  +  Y--) 
j,k,d    jd^jk  jkd      ' jd' 


j  e  J  =  I 

k  t  K  =  I 
deN 


where : 

a  =  expansion  factor  for  query  message  [see  (1-^)] 


Yjj  =  econometrically  estimated  ratio  of  size  of  response  to  size  of 
query  for  requests  from  node  j  to  database  d. 

(.6.4)          Communication  Cost  of  Updates  from  Programs  to 
Databases 

We  make  the  same  assumption  as  above  and  we 
define  y.^  as  the  update  traffic  to  database  d  processed  at  node  j, 
where 


i.,   =  .E   U^   xl 
3d  i,p   ip   ijp 


Using  the  variables  defined  above,  the  total  communication  cost  of 
updates  from  programs  to  databases  is: 


COST  9 


j,k,d 


y.  ,  U.^  X. 
jd  jk     1 


where  g  =  expansion  factor  for  update  (see  QO)] 


j  e  J  =  I 

k  E  K  =  I 

i  e  I 

d  e  N 

p  c  P 


The  objective  function  is,  by  summing  all  the  costs;  equal  to 
9 
COST  i  =  L        cost  i 
i=l 

11.4  The  Constraints 

A  distributed  system  is  feasible  if  the  following  con- 
straints are  satisfied: 

(^)    Existence  of  databases,  programs,  and  communication  lines 

In  order  to  have  a  feasible  solution,  there  must  be  at 
least  one  copy  of  each  database  and  program,  and  a  communication  line. 


These  conditions  are  met  if: 
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Z   x"   > 
1  — 

1 

1 

V      e  N 
n 

Z   z?   > 
1  — 

1 

1 

V     e  p 
P       ^ 

Z   L^    > 
i      ^J   - 

1 

V.    e  J=I   and   c   e   C 
J 

(b)   Transactions  with  defined  routes 

We  have  to  assure  that  every  transaction  to  every  database 
via  every  program  and  from  every  node  will  have  a  defined  route. 
Therefore: 

Zx..   >1,    V.  eI,peP,  dcN 
ijp  —  '    1    '  ^    ' 

j:x.,,>1,   v.eJ=I,deN 


(c)  Residency  of  databases  and  programs  in  accordance  with  the 
defined  routes 

We  must  assure  residency  of  the  appropriate  databases  and 
programs  in  accordance  with  the  defined  routes: 

I  X.*?   <  I  *  x?  V.  e  J=I,  p  e  P,  d  e  N 

I  X  11*4  v^  e  K.  d  e  N 

(d)  Residency  of  programs 

We  must  assure  that  program  p  will  reside  only  in  a  node  at 
which  it  can  be  processed: 

Z?  =  0        v./!l,pep 
J  J    P  ^   ^ 

where  I   is  the  set  of  nodes  at  which  a  given  program  p  can  be  processed. 
For  a  homogeneous  distributed  database  system,  the  set  Ip  for  any 
given  p  is  equivalent  to  the  set  of  the  network  nodes. 
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(^e)   Computation  capacity 

We  shall  assure  that  the  total  processing  requirements  of 
all  transactions  allocated  to  a  given  node  should  not  exceed  the 
computer  capacity  it  this  node.  Therefore,  the  total  computation 
needs  satisfied  by  a  computer  at  a  node  must  be  less  or  equal  to  its 
capacity: 

Z  Q*?  +Z  u*^  <  E  k   y*",   for  all  i  e  I 
p   ^ip  p    ip  —      m  •'  1 '  J   M 

^^^'^m  d£N 

(f)  Existence  of  a  database  with  respect  to  computers 

We  shall  assure  that  databases  may  only  exist  at  nodes 
where  there  are  computers: 

X.  <   E  y. ,   for  all  i  £  I  and  n  e  N 
1    m   1 

(g)  Binary  constraints 

All  the  decision  variables  must  be  binary  variables. 


for  all  i  c  I  and  m  c  M 

for  all  i  £  I  and  n  £  N 

for  alliEl,    j£     J  =   r,   ceC 

for  alliEl,p£     2. 

for  allicl,    J£     J=pEP,d£N 
for  allJ£j=l,kEK,dEN 

II. 5  Final  Formulation  of  the  Problem 

Based  on  the  costs  discussed  in  Section  II-3  and  the  constraints 
discussed  in  section  11-4,  we  can  formulate  the  problem  as  follows: 


Therefore: 

y7 

= 

or  0} 

n 

X. 

.1 

= 

or  0} 

^h 

= 

or  0} 

2? 

1 

= 

or  0} 

d 
"iJP 

= 

or  0} 

X.,  , 
jkd 

= 

or  0} 

is- 


min   {Z  c     y"'     +     E     D^   x"     +     E        (Bl      +(B  D     ))L^ 

ml  .  ni        ..„  c  cij        ij 

i,m  i,n  i,J,c 


.d     ^       „d 


+      E      C.      x"      +      E        S.       Z?      +        E  Q        Q.  .    X    . 

i,n  1 


xn     X         ^,p       ^P     ^       ij.p.d       ^P     ^^      ^JP 


+        E  U"^.      U..    X*^.      +        E     a   X,,   Q...    X.,  ,    (1    +   y.,) 

i,j,p.d      ^JP      ^^      ^JP     j,k.d  ^^     J'^      J'^^  ^^ 

+        EB        v.,  U,.    xf    } 
j.k.d        ^^     "^^    ,\ 

subject  to: 


Ex?     >   1,  V     =   1,2,...,N      • 

1     —     '  n 

1  •                           ■                    . 

E   Z?   >   1 ,  V     =    1 , 2 , .  .  .  ,  P 

i      1  -  P 

I   L^     ^1,  V     =    1,2, ...,I,    V     =    1.2, ...,C 

£      -"-J  J                                   *- 

E      X..      >  V.    =   1,2,...,!,    V      =    1,2,...,    p,    V,   =    1,2,...,    N 

ijp  —  1          '    '        '    '      p          '    '        '   f-'      d 

]     ^jkd-  \  =   1.2,...,!,    v^  =    1,2,...,    N 

E   x^       1  I*x^^  V     =    1,2,...,!,    V  =    1,2,...,   £,    v^  =    1,2,...,    N 

i     ijp              ■^  ■'                                 P 

^  ^jkd  -  ■^*^k'  ^k  ""   1»2,...,I,   v^=   1,2,...,   £,   v^  =    1,2,...,   N 

z^     =0,  V.    ^   Ip,   p  =    1,2,...   p 

J   Q^^    E   U^      <  E      k  y""      ,      V      =    1,2,...,    I 

J     iP        p     iP  m     m     1               1 

v^  =    1,2,...,    N 

x"  <   E   y"",    V  =    1.2,...,    I,    V     =    1,2,. ..,N 

J.        n,     1        ±  n 

m         n     ,  c         P  d                    ,  .                   .    ,  , 
^i    '   ^i'      ij'    ^i'    ^ijp'    ^jkd  binary  variables. 
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III.    SOLUTION  OF  THE  PROBLEM 

III.l    Introduction 

Most  of  the  authors  cited  in  the  introduction  of  this 
paper  used  heuristic  approach  to  solve  the  large-scale  programming 
problem  obtained  in  modeling  distributed  systems.   But  these  methods 
may  not  lead  to  an  optimal  solution.   In  some  cases,  the  solution  may 
be  far  from  the  optimal  value.   Besides,  the  method  may  be  misleading. 
Since  the  objective  function  in  our  model  is  nonlinear  and  since  all 
the  variables  are  integers,  we  have  to  solve  an  integer  nonlinear 
programming  problem. 

Even  though  the  number  of  variables  and  constraints  are 
very  large,  nevertheless  we  have  developed  an  algorithm  for  large- 
scale  integer  nonlinear  programming  problems.   A  Fortran  code  of  the 
algorithm  was  developed  and  tested  on  several  problems.  The  results 
obtained  by  the  computer  code  are  very  encouraging  [see  (2)].  The 
main  advantages  of  the  algorithm  are  the  following: 

It  converges  more  rapidly  than  the  traditional  branch  and  bound 
algorithms  (see  C2  3). 

In  most  of  the  traditional  branch  and  bound  methods  the  member 
of  nodes  of  the  arborescence  that  have  to  be  stored  in  the 
computer  grows  very  rapidly  and  is  uncontrollable. In  this 
;     method,  tlie  number  of  nodes  of  the  arborescence  that  have  to 
be  stored  is  at  most  equal  to  n,  wliere  n  is  the  number  of 
variables  of  the  problem.  Therefore,  it  needs  little  storage 
space. 

It  may  be  applied  to  the  case  where  the  objective  function  and 
the  constraints  are  nonlinear,  without  any  assumption  about 
the  convexity  of  the  problem  (i.e.,  it  can  solve  non-convex 
problems) .  If  the  user  wants  to  stop  the  optimization  pro- 
cedure when  an  amount  of  time  (CPU  time)  is  elapsed,  he  can 
obtain  a  "reasonable"  solution. 

This  method  is  called  "Bounded  Branch  and  Bound"  method. 

I I . 2    Brief  Outline  of  the  Bounded  Branch  and  Bound  Method 

Let  4>  (x)  be  tlie  objective  function  and  h  (x)  t!ie  constraints 
Therefore,  the  formulation  of  the  problem  is: 


rr 


f 


jtiin  ^Cx) ,   xc  R 
s.t. 


h^U)   1  0,    Jl  =    1,2,...,   m 

a.   _<x._<b.,    j    eJ  =   {1,2,...,   n} 

X.    integer  j    e  E <^  J 


(c-1) 
(c-2) 

Cc-3) 


We  assume  that; 


(fi(x)  and  h  (x)  are  nonlinear  functions,  continuously  differenti- 
able 

Constraints  (c-1)  define  a  domain  which  may  be  convex  or 
non-convex. 

Constraints  (c-2)  define  a  parallelotope  (it)  .   u  is  assumed  to 
be  bounded.   In  this  constraint  a.=  0  and  b-;  =  1,  v.  e  J. 

Constraint  (c-3)  indicates  that  if  E  =  J,  all  the  variables 
are  binary.   We  have  to  solve  a  "pure  integer  nonlinear 
programming"  problem,  which  is  our  case.   If  E  j^  J,  we  have 
to  solve  a  "mixed  integer  nonlinear  programming  problem". 

Let  s  =  {x  e  R^j  x  satisfies  (c-1)  and  (c-2)}. 

(a)   First  we  solve  the  following  problem. 

min  <j>  (x) 
s.t. 
X  e  S 


Let  X  be  the  optimal  solution  p  . 


(b)  If  x°  is  binary:   END. 

(c)  otherwise,^  -J2  e  EJXj^  is  not  integer.  We  proceed  to  the 
separation  ot  S  into' two  subsets  z\   and  S2  such  that: 

s,  =  {x  e  r"|x  satisfies  (c-1)  and  (c-2)  and 


x.^.Ca^j  (xOj)) 


/^ 


S2  =  {x  e  R"|x  satisfies  (c-1)  and  (c-2)  and 

X   e  ([x?  ]  +  1.  b.  )} 
Ji      ^1        -' 1 

(fx.  1  means  the  integer  value  of  x.  ) 


To  s,  we  associate  the  following  problem: 


Pi 

To  S2  we  associate  the  following  problem: 

P2 


Then  we  solve  one  of  the  problems  and  put  the  other  one  in  a 
waiting  list. 

(d)  If  all  the  ix.).    p  are  binary,  go  to  (e) .   Otherwise  go  to 
(c). 

(e)  If  this  solution  is  better  than  the  first  one  obtained,  store 
it.   Otherwise  refuse  the  corresponding  node.   If  there  is 

a  problem  in  the  waiting  list  solve  it  and  go  to  (d). 
Otherwise:   END. 


III. 3  The  Oriented  Graph  Associated  with  the  Problem 

After  applying  the  principle  of  separation  described  above, 
we  obtain  an  arborescence.   This  arborescence  may  be  considered 

as  an  oriented  graph.   Let  us  describe  this  graph  and  show  how  to 
use  it. 


If 


Consider  a  set  A  such  that  A^^E.   Let  x^  =  {x-j|j  e  A}  and 
x^  the  integer  components  of  x^^  satisfying  (c-2)  .   To  s=  (A,Xy\)  we 
associate  the  following  problem: 


P(s)  i 


Max  t (x) 

hj(x)  £  0   i  e  I 

x .  =  X.    j  e  J 


j  e  J 


Let  Xg  and  41  be  the  optimal  solution  of  £(s) .   , 

(a)  The  couple  S  =(A,  x^)  represents  a  node  of  the  graph  G. 

(b)  To  each  node  s,  we  associated  its  "level"  in  the  graph,  called 
t  .   t  is  equal  to  the  number  of  components  of  x  which  are 
integers. 

By  applying  the  principles  of  separation  described  in  section  III. 2, 
we  obtain  the  following  type  of  aborescence 


© 


(1> 


-© 


€) 


where  S^  is  the  root  of  the  arborescence. 


Let  E(S)  =  {j  e  E|xj(S)  is  integer}.   Therefore,  wp  have  the  following 
relationship  A^'E(S)cE 
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When  E(S)  =  P.,  constraint  Cc-3)  is  satisfied.  Therefore  we  have 
found  a  solution  to  the  problem  and  the  corresponding  node  S  is 
called  "terminal  node". 

node  Sj^  is  in  the  right  wing  of  the  graph.   Nodes  S^  and  Sj^ 
are  in  the  left  wing  relatively  to  S^^. 

Cc)  Let  S  be  a  non-terinal  nodes  and  BeE-E(S);   node  T  is  a 

successor  of  S,  with  a  level  tj=  tg+l, where  T  is  defined  as 
follows : 

T  =  (B,  Xg) 

where     B  =  e(S)  Ui'B} 

x.=x.^  ,  v.e  E(S) 

x^=[Xg(S)]  +  Y 

For  a  left  wing  y  -   -1>  -2,  -3,,.. 

For  right  wing   y=  0,   1,   2   3,... 

Note:   l\lien  a  variable  x.  is  fixed  at  a  given  integer  value,  it  keeps 
this  same  value  in  the  wing  of  the  arborescence. 

A  more  detailed  version  of  this  algorithm  is  given  in  (2  )  . 

A  fortran  code  for  this  algorithm  was  written  and  tested  on 
several  problems.' 
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IV     SUMNIARY,  POSSIBLE  APPLICATIONS  AND  TUTURE  RESEARCH  DIRECTIONS 

IV.   Summary 

In  this  paper,  we  developed  a  detailed  model  which  includes 
computation  power  allocation,  databases  and  communication  lines 
assignment,  routing  message  and  program  sharing.   It  takes  into  account 
set-up  costs  and  the  return  flow.   An  operational  solution  procedure 
was  described.   It  is  based  on  a  mathematical  programming  algorithm 
which  is  able  to  solve  mixed  integer  nonlinear  programming  problem  with 
or  without  any  assumptions  about  the  convexity  of  the  objective  function 
and  constraints. 

IV. 2  Possible  Applications 

(a)  Application  to  the  Design  of  Computer  Network 

The  model  developed  in  this  paper  can  be  applied  to 
the  design  of  a  computer  network.   It  will  allow  the  designer  to 
define  the  network  topology  and  to  assign  communication  lines 
capacities.  The  designer  can  use  it  as  a  technique  to  produce  minimal 
cost  designs  taking  into  consideration  the  economics  of  scale  which 
exist  for  real  computation  facilities. 

The  model  can  be  also  used  by  users  to  determine 
whether  to  join  a  computer  network  or  to  operate  their  own  computer 
facilities. 

(b)  Application  to  the  Distributed  Databases  Problem 

This  model  can  be  viewed  as  an  extension  of  the 
Levin- Morgan  (19)  model.   It  incorporates  set-up  costs  and  tl)e 
return  flow  of  informations.   If  one  views  a  single  computer  as  a 
special  case  of  computer  network,  our  model  can  be  interpreted 
as  a  combination  of  Chen's  work  (7  )  with  various  network  flow 
problems.   Besides,  the  model  can  be  used  to  minimize  the  operating 
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cost  of  distributed  databases,  shared  by  a  community  of  users  inter- 
connected through  a  computer  network.  In  particular,  it  can  be  used 
to  find  the  optimal  database/program  locations  in  a  computer  network, 

(c)  Application  to  the  Centralization  versus 
Decentralization  Issue 


Until  recently,  the  management-oriented  literature 
focused  only  on  the  respective  advantages  and  disadvantages  of 
centralized  -  decentralized  systems  (15)  .   One  comprehensive  model 
which  includes  guidance  for  managers  dealing  with  the  centralization- 
decentralization  issue,  was  developed  by  Rockart  et  al .  (22).   One 
important  aspect  of  this  model  is  how  to  determine  the  most  effective 
range  of  configurations. 

One  of  the  first  quantitative  approaches  to  this 
question  was  taken  by  Streeter  (23) .   His  model  was  generalized  by 
Chen  et  al.  (8  ) .  One  shortcoming  of  these  models  is  that  they  are 
unconstrained  and  do  not  take  into  account  many  important  factors. 
Therefore  our  model  can  be  viewed  as  a  generalization  of  Streeter  and 
Chen's  models.   Very  few  simplications  are  needed  to  derive  from  our 
model  the  formulations  of  centralized  and  decentralized  systems. 
Since  the  latest  are  subsystems  of  the  distributed  system,  by  solving 
our  model  we  can  determine  a  range  of  configurations  from  fully 
centralized  to  fully  decentralized  systems.  Therefore,  the  model  can 
be  used  to  evaluate  configurations  and  to  help  managers  to  determine 
the  configuration  which  minimizes  the  tangible  costs. 

IV. 3  Future  Research  Directions 

The  model  can  be  extended.  We  can  incorporate  breakdo\\m 
costs  due  to  breakdown  in  the  computers  and  the  databases.   In  order 
to  take  fully  in  account  the  problem  of  reliability,  breakdo\\m  cost 
due  to  communication  lines  malfunction  can  be  added,   UTien  used  to 
evaluate  configurations  related  to  the  centralization-decentralization 
issue,  the  model  can  include  other  costs  like  personnel  cost, 
maintenance  cost  and  replacement  cost.   Finally,  an  interactive  pro- 
gram can  be  developed  using  the  present  code.   This  will  enable 
managers  to  use  it  as  a  decision  support  system. 
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